Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Model based restoration of document images for OCR

Identifieur interne : 002881 ( Main/Exploration ); précédent : 002880; suivant : 002882

Model based restoration of document images for OCR

Auteurs : M. Y. Jaisimha [États-Unis] ; E. A. Riskin [États-Unis] ; R. Ladner [États-Unis] ; W. Stuetzle [États-Unis]

Source :

RBID : Pascal:97-0010332

Abstract

This paper presents a methodology for model based restoration of degraded document imagery. The methodology has the advantages of being able to adapt to nonuniform page degradations and of being based on a model of image defects that is estimated directly from a set of calibrating degraded document images. Further, unlike other global filtering schemes, our methodology filters only words that have been misspelled by the OCR with a high probability. In the first stage of the process, we extract a training sample of candidate misspelled word subimages from the set of calibration images before and after the degradation that we wish to undo. These word subimages are registered to extract defect pixels. The second stage of our methodology uses a Vector Quantization based algorithm to construct a summary model of the defect pixels. The final stage of the algorithm uses the summary model to restore degraded document images. We evaluate the performance of the methodology for a variety of parameter settings on a real world sample of degraded FAX transmitted documents. The methodology eliminates up to 56.4% of the OCR character errors introduced as a result of FAX transmission for our sample experiment.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Model based restoration of document images for OCR</title>
<author>
<name sortKey="Jaisimha, M Y" sort="Jaisimha, M Y" uniqKey="Jaisimha M" first="M. Y." last="Jaisimha">M. Y. Jaisimha</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>MathSoft, Inc., 1700 Westlake Ave. N, Suite 500</s1>
<s2>Seattle, WA 98109</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Washington (État)</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Riskin, E A" sort="Riskin, E A" uniqKey="Riskin E" first="E. A." last="Riskin">E. A. Riskin</name>
<affiliation wicri:level="4">
<inist:fA14 i1="02">
<s1>University of Washington</s1>
<s2>Seattle, WA 98195</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Washington (État)</region>
<settlement type="city">Seattle</settlement>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
<author>
<name sortKey="Ladner, R" sort="Ladner, R" uniqKey="Ladner R" first="R." last="Ladner">R. Ladner</name>
<affiliation wicri:level="4">
<inist:fA14 i1="02">
<s1>University of Washington</s1>
<s2>Seattle, WA 98195</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Washington (État)</region>
<settlement type="city">Seattle</settlement>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
<author>
<name sortKey="Stuetzle, W" sort="Stuetzle, W" uniqKey="Stuetzle W" first="W." last="Stuetzle">W. Stuetzle</name>
<affiliation wicri:level="4">
<inist:fA14 i1="02">
<s1>University of Washington</s1>
<s2>Seattle, WA 98195</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Washington (État)</region>
<settlement type="city">Seattle</settlement>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">97-0010332</idno>
<date when="1996">1996</date>
<idno type="stanalyst">PASCAL 97-0010332 INIST</idno>
<idno type="RBID">Pascal:97-0010332</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000977</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000A21</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000942</idno>
<idno type="wicri:doubleKey">1017-2653:1996:Jaisimha M:model:based:restoration</idno>
<idno type="wicri:Area/Main/Merge">002A29</idno>
<idno type="wicri:Area/Main/Curation">002881</idno>
<idno type="wicri:Area/Main/Exploration">002881</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Model based restoration of document images for OCR</title>
<author>
<name sortKey="Jaisimha, M Y" sort="Jaisimha, M Y" uniqKey="Jaisimha M" first="M. Y." last="Jaisimha">M. Y. Jaisimha</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>MathSoft, Inc., 1700 Westlake Ave. N, Suite 500</s1>
<s2>Seattle, WA 98109</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Washington (État)</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Riskin, E A" sort="Riskin, E A" uniqKey="Riskin E" first="E. A." last="Riskin">E. A. Riskin</name>
<affiliation wicri:level="4">
<inist:fA14 i1="02">
<s1>University of Washington</s1>
<s2>Seattle, WA 98195</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Washington (État)</region>
<settlement type="city">Seattle</settlement>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
<author>
<name sortKey="Ladner, R" sort="Ladner, R" uniqKey="Ladner R" first="R." last="Ladner">R. Ladner</name>
<affiliation wicri:level="4">
<inist:fA14 i1="02">
<s1>University of Washington</s1>
<s2>Seattle, WA 98195</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Washington (État)</region>
<settlement type="city">Seattle</settlement>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
<author>
<name sortKey="Stuetzle, W" sort="Stuetzle, W" uniqKey="Stuetzle W" first="W." last="Stuetzle">W. Stuetzle</name>
<affiliation wicri:level="4">
<inist:fA14 i1="02">
<s1>University of Washington</s1>
<s2>Seattle, WA 98195</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Washington (État)</region>
<settlement type="city">Seattle</settlement>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
<imprint>
<date when="1996">1996</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper presents a methodology for model based restoration of degraded document imagery. The methodology has the advantages of being able to adapt to nonuniform page degradations and of being based on a model of image defects that is estimated directly from a set of calibrating degraded document images. Further, unlike other global filtering schemes, our methodology filters only words that have been misspelled by the OCR with a high probability. In the first stage of the process, we extract a training sample of candidate misspelled word subimages from the set of calibration images before and after the degradation that we wish to undo. These word subimages are registered to extract defect pixels. The second stage of our methodology uses a Vector Quantization based algorithm to construct a summary model of the defect pixels. The final stage of the algorithm uses the summary model to restore degraded document images. We evaluate the performance of the methodology for a variety of parameter settings on a real world sample of degraded FAX transmitted documents. The methodology eliminates up to 56.4% of the OCR character errors introduced as a result of FAX transmission for our sample experiment.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>Washington (État)</li>
</region>
<settlement>
<li>Seattle</li>
</settlement>
<orgName>
<li>Université de Washington</li>
</orgName>
</list>
<tree>
<country name="États-Unis">
<region name="Washington (État)">
<name sortKey="Jaisimha, M Y" sort="Jaisimha, M Y" uniqKey="Jaisimha M" first="M. Y." last="Jaisimha">M. Y. Jaisimha</name>
</region>
<name sortKey="Ladner, R" sort="Ladner, R" uniqKey="Ladner R" first="R." last="Ladner">R. Ladner</name>
<name sortKey="Riskin, E A" sort="Riskin, E A" uniqKey="Riskin E" first="E. A." last="Riskin">E. A. Riskin</name>
<name sortKey="Stuetzle, W" sort="Stuetzle, W" uniqKey="Stuetzle W" first="W." last="Stuetzle">W. Stuetzle</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002881 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002881 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:97-0010332
   |texte=   Model based restoration of document images for OCR
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024